From Bilingual Dictionaries to Interlingual Document Representations

نویسندگان

  • Jagadeesh Jagarlamudi
  • Hal Daumé
  • Raghavendra Udupa
چکیده

Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we develop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Bilingual Projections via Sparse Covariance Matrices

Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads t...

متن کامل

An Approach to Development of Bilingual Lexical Resources

This paper outlines how Bibliša, a tool initially designed for search of digital libraries of articles from bilingual e-journals in the form of TMX documents, is used for development of a new bilingual lexical resource. The approach relies on already available resources, Serbian morphological e-dictionaries, Serbian and English wordnets connected via the interlingual index, and a bilingual Dict...

متن کامل

Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links

This paper presents a new research and development project called Papillon. It started as a French-Japanese cooperation between laboratories GETA/CLIPS (Grenoble, France) and NII (Tokyo, Japan). Its goal is to build a multilingual lexical database and to extract from it digital bilingual dictionaries. The database is built with monolingual dictionaries, one for each language of the database, li...

متن کامل

Monolingual and bilingual dictionary approaches to the enrichment of the Spanish WordNet with adjectives

We report on two different approaches to the incorporation of adjectives in Spanish WordNet based on automatic extraction techniques using EuroWordNet and machine-readable dictionaries. We show that a monolingual dictionary approach enables to exploit relations between different parts of speech and enrich the internal structure of the Spanish WordNet, while the methods based on bilingual dictio...

متن کامل

Using Wikipedia for Named-Entity Translation

In this paper we present a system for translating named-entities from Basque to English using Wikipedia’s knowledge. We can exploit interlingual links from Wikipedia (WIL) to get named-entity translation, but entities without interlingual links can be translated using the Wikipedia as a corpus, suggesting new interlingual links. In this second case the interlingual links can be used as a test c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011